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ABSTRACT 

LAGLIDADG homing endonucleases (LHEs) are a 
family of highly specific DNA endonucleases capable 
of recognizing target sequences 20 bp in length, 
thus drawing intense interest for their potential 
academic, biotechnological and clinical applica- 
tions. Methods for rational design of LHEs to 
cleave desired target sites are presently limited by 
a small number of high-quality native LHEs to serve 
as scaffolds for protein engineering — many are un- 
satisfactory for gene targeting applications. One 
strategy to address such limitations is to identify 
close homologs of existing LHEs possessing 
superior biophysical or catalytic properties. To test 
this concept, we searched public sequence data- 
bases to identify putative LHE open reading 
frames homologous to the LHE l-Anil and used a 
DNA binding and cleavage assay using yeast 
surface display to rapidly survey a subset of the 
predicted proteins. These proteins exhibited a 
range of capacities for surface expression and 
also displayed locally altered binding and cleavage 
specificities with a range of in vivo cleavage 
activities. Of these enzymes, l-HjeMI demonstrated 
the greatest activity in vivo and was readily crys- 
tallizable, allowing a comparative structural 
analysis. Taken together, our results suggest that 
even highly homologous LHEs offer a readily 
accessible resource of related scaffolds that 
display diverse biochemical properties for biotech- 
nological applications. 



INTRODUCTION 

LAGLIDADG homing endonuclease (LHE) genes are 
mobile genetic elements that code for rare cleaving DNA 
enzymes, which in turn are responsible for catalyzing their 
mobility, known as homing. The homing process relies on 
the generation of DNA double strand breaks in an allele 
lacking the LHE gene insertion, which stimulates homolo- 
gous recombination (HR) using the LHE-containing allele 
as the template (1,2). As an LHE's physiological recogni- 
tion sequence is ~20bp in length, it appears on average 
only once every ~10 12 bases. Even after accounting for an 
LHE's promiscuity at individual DNA bp positions, the 
overall specificity of these enzymes appears to be at least 
approximately one in 10 9 . Consequently, LHEs have 
drawn attention as rare cleaving nucleases for use in 
diverse site-specific genome engineering applications, par- 
ticularly for organisms with large genomes (3-5). 

An important limitation to widespread application of 
LHEs in genome engineering is the requirement to 
modify a starting native LHE ('scaffold') to create 
variants of that scaffold that cleave at specific desired 
target sites. Although computational design methods 
and selection protocols for this purpose are now quite 
advanced (6-10), it remains challenging to consistently 
produce variants with high levels of in vivo activity. One 
constraint on present approaches for engineering LHE's is 
their narrow application to a small set of previously 
reported, well characterized, native LHE scaffolds: 
I-Scel, I-Crel, I-Dmol, I-Anil and I-Onul (11-15). 
We hypothesized that because members of this small 
group were not originally identified based on specific 
biotechnologically useful properties, that homologous 
proteins might represent a source of closely related scaf- 
folds that possess a desirable range of such properties. 
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To address this question, we searched public sequence 
databases to identify open reading frames (ORFs) 
encoding proteins homologous to the LHE I-Anil and 
surveyed the properties of a subset of these proteins. 
Individual proteins were assessed using an assay that 
relies upon yeast surface display and that reports upon 
protein folding, expression, DNA binding and cleavage 
(15,16). Each of these properties can then be assayed by 
flow cytometric analysis in high throughput, detecting 
binding or cleavage of fluorescently labeled oligonucleo- 
tides. A separate in vivo genome engineering reporter 
assay was then used to measure targeted gene modification 
activity in transfected human cells (16 18). These analyses 
revealed that I-Anil's close homologs exhibit a broad spec- 
trum of in vitro and in vivo activities. The best-performing 
enzyme in this group, I-HjeMI, was readily expressed, 
purified and crystallized, facilitating a comparative struc- 
tural analysis of the two enzyme scaffolds. These results 
delineate a robust approach for identifying related LHE 
scaffolds and illustrate the value of this approach for iden- 
tifying scaffolds with optimal biotechnological properties. 

MATERIALS AND METHODS 

Yeast surface display expression constructs and flow 
cytometric expression analysis 

The ability of an LHE to bind and cleave a broad panel of 
DNA target sequences can be readily assayed using 
enzyme constructs that are displayed on the surface of 
yeast, as described in Jarjour et al. (16). Yeast surface 
display of I- Anil homologs on EBY100 Saccharomyces 
cerevisiae was achieved using the standard vector back- 
bones and methods described previously (17). Putative 
LHE ORF sequences were selected, corresponding to 
full-length I-Anil beginning three to four amino acids 
before the first LAGLIDADG helix. Corresponding 
DNA sequences were synthesized and cloned into the 
pETCON2 vector (map available on addgene.org) 
between N-terminal hemagglutinin (HA) tag and 
C-terminal Myc tag coding sequences using Nhel and 
Xbal; clones were verified by sequencing. Accession 
numbers for the protein sequences of I-AchMI, I-HjeMI, 
I-PnoMI, I-TasMIP, I-TinMIP and I-VinIP are 
AAX34413, BK008014, ABU49435, BK008015, 
BK008016 and AAB95258, respectively. Strains harboring 
these vectors were grown in media containing 2% raffin- 
ose + 0.1% glucose at 30°C for 1 day before induction in 
2% galactose for 2-3 h at 30°C and 18-26 h at 20°C. To 
measure expression levels, 10 6 cells were washed in yeast 
staining buffer (YSB): 180 mM KC1, 10 mM NaCl, 0.2% 
bovine serum albumin (BSA), 0.1% galactose and 10 mM 
Hepes, pH 7.5. Cells were then stained with a 1:100 dilution 
of ICL Labs' odVIyc-FITC antibody and a 1:250 dilution of 
biotinylated aHA (Covance) antibody in YSB for 30min 
at 4°C. Cells were washed and counterstained with 
streptavidin-PE (BD Biosciences) in YSB for 15min at 
25°C, washed again and run on a BD LSRII™ cytometer 
(BD Biosciences). The output was analyzed using FloJo 
software (Tree Star) for the percentage FITC-positive 
cells when compared with an unstained population. 



Immunoprecipitation and western blot of surface-released 
protein 

Approximately 250 million expressing yeast cells (induced 
as above) were harvested, washed twice in 1 x phosphate- 
buffered saline (PBS, Thermo Scientific) and incubated for 
lh at 30° C in lml2mM dithiothreitol in PBS with 
protease inhibitor (complete mini EDTA free, Roche) to 
liberate the LHEs; this is accomplished by reducing the 
disulfide bond anchoring the Aga2P-LHE fusion to the 
surface expressed AgalP protein (Supplementary Figure 
S5). The release reaction was quenched with 10 mM 
iodoacetamide for lOmin at 25°C to allow subsequent 
immunoprecipitation. The LHE-containing supernatant 
was incubated with 1:100 monoclonal rabbit aHA 
antibody (C29F4, Cell Signaling) for lh at 4°C and 
precipitated with protein A-conjugated Sepharose (GE 
Healthcare) by incubation overnight at 4°C. Samples 
were treated with PNGaseF (New England BioLabs) ac- 
cording to the manufacturer's protocol to remove glycosyl 
residues and allow proper migration on a gel. Samples 
were prepared by boiling in 1 x Laemmli buffer (Bio-Rad). 

Denaturing polyacrylamide gel electrophoresis and 
western blot to a polyvinylidene fluoride membrane were 
performed using standard protocols. The blot was stained 
with a 1:1000 dilution of rabbit aHA antibody (Cell 
Signaling), washed and counter-stained with a 1:5000 
dilution of donkey aRabbit, horseradish peroxidase 
antibody (GE Healthcare) for imaging with the ECL 
system using Kodak Biomax light film. 

Flow cytometric cleavage assay, end-holding and 
specificity profiling 

The catalytic activity of each LHE was measured by 
tethering Alexa647-fluorescent target dsOligo to the sur- 
face expressed LHE and measuring the decrease in fluor- 
escence associated with dsOligo cleavage. Biotinylated 
fluorescent dsOligo is tethered to the HA epitope via an 
antibody-streptavidin bridge. Approximately 5 x 10 5 cells 
were first stained with 1:250 dilution biotinylated aHA 
(Covance) and 1:100 fluorescin isothiocyanate (FITC)- 
conjugated aMyc (ICL Labs) for 30min at 4°C in the 
YSB. Preconjugated streptavidin-PE:Biotin-ds01igo- 
A467 was then bound to the yeast via the HA-biotin- 
streptavidin-PE interaction. This secondary stain was per- 
formed in the same buffer plus 400 mM KC1 to allow 
biotin-streptavidin conjugation while disallowing the 
LHE to bind the dsOligo directly. Cells were washed 
in the cleavage solution: lOmM NaCl, 113mM 
K-Glutamate, 0.05% BSA and lOmM HEPES and 
pH 8.2. Cells were resuspended in the cleavage buffer 
and split into two wells each. Each pair of wells were 
centrifuged and resuspended in cleavage buffer plus 
2mM either MgCl 2 (cleavage permissive) or CaCl 2 
(cleavage restrictive); fluorescence loss due to 
magnesium-dependent cleavage of the dsOligo can subse- 
quently be measured in these otherwise identical sample 
pairs. After a 20-min cleavage incubation at 37°C, cells 
were pelleted and resuspended in cold secondary stain 
buffer plus 4mM EDTA to aid release of cleaved 
substrate and mitigate any end-holding effects on 
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dsOligo-fluorophore release. In subsequent experiments, 
end-holding was determined by an increased loss in fluor- 
escence when the fluorophore was conjugated to the plus 
half of the DNA substrate compared with when it was 
conjugated to the minus half during the flow cleavage 
assay; the final high-salt wash was not performed. 

Sample fluorescence was measured on a BD LSRII™ 
cytometer, and the resulting data were analyzed using 
Flowjo. Each sample was normalized for enzyme concen- 
tration by applying an identical narrow FITC gate. Cells 
were then controlled for initial substrate concentration by 
adjusting a narrow PE gate for each non-cleaving Ca ++ 
sample until the median A647 fluorescence intensity was 
matched for all samples. Relative cleavage efficiencies 
were derived for this normalized population by dividing 
the median DNA-A647 fluorescence value of the Mg ++ 
sample (reduced fluorescence due to cleavage) by the cor- 
responding median fluorescence value of the Ca ++ 
matched pair (no cleavage). Higher Ca ++ /Mg ++ ratios 
indicate more cleavage. 

Specificity profiles were produced by determining 
cleavage of each of the 60 possible target sequences 
wherein each base at each of the 20 positions was 
substituted with each of the alternate three bases, as in 
Jarjour et al.'s (17) original description of this assay. In 
these experiments, all Ca ++ /Mg ++ ratios were normalized 
to the Ca ++ /Mg ++ ratio of the native target site. 

Assessment of in vivo gene modification activity 

Each LHE's target site was ligated into the truncated 
green fluorescent protein (GFP) of the traffic light 
reporter (18) using annealed, phosphorylated dsOligo 
(Supplementary Figure SI a). Lentivirus containing this 
construct was used to transduce HEK 293T cells at 
limiting dilution to obtain a population of cells with 
single copy chromosomal integration events. Cells were 
sorted against GFP and mCherry fluorescence to ensure 
that the reported started in the 'off state. Endonuclease 
expression/GFP repair template vectors were generated by 
cloning each LHE from the yeast surface display vectors 
into the Lentiviral backbone containing the GFP repair 
fragment (Supplementary Figure Sib). ORFs were ligated 
in frame with a self-cleaving T2A peptide sequence, 
followed by a blue fluorescent protein, mTagBFP, to 
allow expression levels to be measured. On Day 0, 
1 x 10 5 HEK cells of each reporter cell line were plated. 
On Day 1, each reporter cell line was transfected with 
400 ng of LHE expression/repair plasmid with poly- 
ethylenimine at a wt/wt ratio of 4:1 in a pH 7, 150mM 
NaCl, 5 mM HEPES buffer. Cell medium was replaced on 
day 2, and cells were allowed to accumulate conversion 
events until Day 4, when they were analyzed by flow 
cytometry on a BD LSRII T . Using FloJo software, 
each expressing population was defined by mTagBFP 
fluorescence. GFP + and mCherry + statistics, representing 
HR and mutagenic non-homologous end-joining events, 
respectively, were tabulated for these populations. 
mTagBFP positivity was determined in comparison with 
non-transfected cells for each cell line; GFP and mCherry 
in comparison with non-expressing populations in the 



transfected cells. To ensure that the non-expressing popu- 
lation was truly not expressing the construct, a small 
number of the highest mTagBFP-low cells were excluded 
from the non-expressing population. 

Protein expression and purification 

The IHjeMI reading frame was ligated into a commer- 
cially available pET15b expression plasmid (Novagen, 
Inc) that incorporates an N-terminal 6-histidine affinity 
purification tag and subsequent thrombin cleavage site 
prior to the endonuclease reading frame. One-point 
mutation was incorporated into the I-HjeMI coding 
sequence (corresponding to L232K), based on the know- 
ledge that a similar mutation at that position increases the 
solubility of the homologous I-Anil (19). The I-Anil con- 
struct used for parallel expression experiments under the 
same conditions was as described previously (20). Both 
I-HjeMI and I-Anil constructs were expressed in 
Escherichia coli strain BL21-CodonPlus (DE3)-RIL 
(Stratagene Inc.), using a method described previously 
for automatic induction of protein expression (21). 

Harvested cells were collected by centrifugation, resus- 
pended in 500 mM NaCl, 50 mM Tris-HCl, pH 8.0, 5% 
glycerol with 0.2 mM phenylmethylsulfonyl fluoride and 
benzonase and lysed by sonication. After a second centri- 
fugation step, the clarified cell lysate was filtered (45 \i pore 
size), purified using a single Heparin affinity purification 
chromatography step (HiTrap Heparin HP, GE 
Healthcare Life Sciences) and eluted with an increasing 
gradient of 0.5-1.0 M NaCl (Supplementary Figure S2). 
The resulting protein was exchanged into thrombin 
cleavage buffer and the N-terminal His-tag was proteo- 
lytically removed. The homing endonuclease protein was 
then purified from the thrombin cleavage products by 
incubating the sample with nickel-NTA agarose resin (to 
bind the cleaved histidine tag and linked fusion polypep- 
tide), followed by size exclusion chromatography. 

Crystallographic analysis 

The DNA oligonucleotides used for cocrystallization 
(5'-GCG CTG AGG AGG TTT CTC TGT TAA GCG 
A-3' and 5'-CGC TTA ACA GAG AAA CCT CCT CAG 
CGC T-3') were synthesized by Eurofins MWG Operon 
Inc (desalted; 50nmol scale syntheses). The oligonucleo- 
tides were dissolved in 10 mM Tris-EDTA buffer pH 7.8, 
to a final concentration of the resulting DNA duplex of 
1 mM, and the complementary DNA strands were 
annealed by incubation at 95°C for 5min and cooling to 
25°C, over a 2-h period. Purified I-HjeMI protein 
described above was mixed with 1.2-fold molar excess of 
the DNA substrate for a final concentration of 4.5mg/ml 
protein, in the presence of 1 mM CaCl 2 , 400 mM NaCl 
and 50 mM Tris-HCl. The protein-DNA drops were 
mixed in a 1:1 volume ratio with a reservoir solution con- 
taining 0.2 M ammonium sulfate, 0.1 M bis-Tris pH 5.5 
and 25% polyethelylene glycol 3350. Crystals grew within 
1 week and were frozen by transfer for 1-2 min to crystal- 
lization reservoir solution supplemented with 30% sucrose 
(w/v), followed by direct submersion into liquid nitrogen. 
The space group of the crystals corresponded to P2|2]2; 
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a = 181.54; b= 73.58; c = 82.0 A. The crystals diffracted 
up to ~2.5A resolution at the ALS beamline 5.0.2 
(Lawrence Berkeley National Laboratory). Data sets 
were processed using the HKL2000 software package 
(22). The structure of the I-HjeMI/DNA complex was 
solved by molecular replacement using the protein data 
bank (PDB) coordinates of the WT I-Anil/DNA 
complex (PDB: 2QOJ), and was modeled using COOT 
(Crystallography Object-Oriented Toolkit) (23) and 
refined using REFMAC/CCP4i (24). Due to significant 
disorder displayed by one of the two independent copies 
of the protein-DNA complex in the crystal asymmetric 
unit (which resulted in poor refinement statistics across 
the upper resolution shells),the final modeling and refine- 
ment was carried out to 3 A resolution. While the values 
for -R W ork and Rf iee were still elevated at this resolution 
(0.28/0.36), the quality of all other refinement metrics 
and the fit of the well-ordered complex to experimental 
electron density were excellent. 

RESULTS 

Identification of I-Anil homologs and their target sites 

To identify LHE-coding sequences homologous to 
I-Anil, the National Center for Biotechnology Infor- 
mation 'tblastn 1 function was used to identify multiple 
putative LHEs of varying similarity in public sequence 
databases. Six homologs, each identified in different 
fungal mitochondrial genomes, and whose alignments 
are shown in Figure 1A, were selected based on the con- 
servation of catalytic Mg ++ -coordinating residues within 
the LAGLIDADG motif to increase chances of finding 
an active LHE. The introns containing these putative 



LHEs also had flanking sequences differing from 
I-Anil, suggesting slightly altered cleavage specificities 
(25,26) (Figure IB). These homologs were named accord- 
ing to the conventions put forth by Roberts et al. (27); 
notably, a 'P' suffix denotes a homolog of unverified en- 
zymatic functionality. The additional suffix M was added 
to avoid redundancy in nomenclature relative to previous- 
ly identified restriction or homing endonucleases, and to 
also denote that the host genome that harbored the LHE 
gene was mitochondrial. Figure 1C shows the putative 
target sequences of the six homologs as determined by 
the comparison of the sequence flanking the LHE 
insertion, where the I- Anil target sequence is found. 

Assessment of nuclease functionality 

Yeast surface display represents a convenient method for 
characterizing putative LHEs in high throughput, as it 
provides facile access to quantitative information on 
protein folding and stability, DNA binding and 
cleavage, without the need for large scale enzymatic puri- 
fication (17,28). In this approach, the enzyme is fused to 
an inducible surface displayed protein, Aga2p, which is 
anchored to the yeast's exterior by two disulfide bonds 
(29). Transit through the ER quality control and secretory 
pathways helps ensure that only LHEs which are stably 
folded at the induction temperature (20-30°C) are 
expressed on the yeast surface; dysfunctional variants 
which do not fold correctly are retained in direct propor- 
tion to their thermal stability (30). To compare the 
properties of six of the closest I-Anil homologs identified 
in Figure 1, yeast codon optimized ORFs were 
synthesized (Supplementary Table SI) and subcloned to 
the pCTCON2 vector. Relative expression levels were 
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Figure 1. Predicted homologs and targets. (A) The alignment of the I-Anil homologs with the residues shaded by chemical similarity. LAGLIDADG 
motifs are marked by waved lines. Conserved Mg ++ -coordinating residues and DNA-contact rich strand-turn-strand regions are also annotated. The 
homologous serine 111, a residue important for increased catalytic activity in I-Anil, is starred. The map was generated by Mac Vector using 
Gonnet-weighted pairwise and multiple sequence alignments with residue-specific and hydrophilic penalties. Residue numbering was matched to 
I-Anil, based on the first LAGLIDADG motif. (B) Schematic depicting the original host gene (black) with intron insertion (white) from which the 
LHE ORF sequences were taken and the exon/intron junctions used to predict target sequences (gray). (C) Predicted targets for each homolog, 
derived by comparing flanking intron/exon regions for each intronic LHE with those from I-Anil; differences therefrom are shaded. 
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Figure 2. LHE characterization by flow cytometry. (A) Expression of full-length protein, determined by flow cytometry in a representative experi- 
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summarized for five replicates (three for I-TasMIP, I-TinMIP and I-VinIP) with standard deviation plotted. (C) A western blot using antibodies 
against the N-terminal epitope tag allowed visualization of full length and truncated protein. Much of I-AchMI was expressed as ~33 kDa protein 
fragment. Only minimal full-length protein and primarily heterogeneously truncated I-TasMIP, I-TinMIP and I-VinIP products were expressed, while 
I-Hje, I-PnoMI and I-Anil were primarily full length and in great abundance. (D) Demonstration of the gating strategy used to normalize substrate 
for the flow cleavage assay. These displayed populations are already normalized for enzyme concentration by a uniform, narrow FITC (C-terminal 
epitope) gate (data not shown). Equivalent amounts of tethered dsOligo across samples was selected by finding a streptavidin-PE level (rectangle) for 
each sample for which all DNA-A647 median fluorescence intensities (dashed horizontal line) were equal in the Ca ++ sample (blue population). This 
gate was held constant for the matched pair Mg ++ sample (red population), allowing quantification of magnesium-dependent loss of the 
DNA-conjugated fluorophore. The left half of the plot shows the population in the rectangular PE gate from the right plot (follow arrow). 
(E) Dividing the median Alexa647 fluorescence intensity of the calcium-containing sample (blue) by that of the magnesium-containing sample 
(red) yields a ratio proportional to the amount of enzymatic activity for a given LHE. 



assessed using staining for N-terminal hemaggluTinMIn 
and C-terminal myc epitope tags (28,30) (Figure 2A). 
Three of the six homologs, I-AchMI, I-HjeMI and 
I-PnoMI, expressed full-length proteins on the yeast 
surface; the latter two very well, as determined by the 
level of C-terminal epitope tag expression (Figure 2B). 
I-TasMIP, I-TinMIP and I-VinIP surface expressed 
poorly, presumably because they were insufficiently 
stable at the 30°C induction temperature. Consistent 
with this interpretation, poor surface expression corre- 
lated with the accumulation of heterogeneously truncated 
proteins containing only the N-terminal tag, a pattern 
confirmed by western blot of the surface released protein 
(Figure 2C) and congruent with previous observations of 
surface-expressed proteins of low thermostability (31-34). 
Notably, the level of surface expression correlated with the 
level of amino acid sequence homology to I-Anil 
(Supplementary Figure S3). 



The three homologs with detectable surface expression, 
I-AchMI, I-PnoMI and I-HjeMI, were further assayed for 
binding and cleavage properties using fluorescently 
labeled oligonucleotide containing the predicted target 
site. Each bound their predicted native target with 
similar affinity to I-Anil (Supplementary Figure S4). 
Cleavage analysis was assessed using a previously 
described tethered oligonucleotide assay (17,20), depicted 
in Supplementary Figure S5, wherein enzyme and sub- 
strate levels were normalized (Figure 2D). I-HjeMI and 
I-PnoMI demonstrated catalytic activity against their 
putative DNA target sequences at levels comparable 
with, or slightly greater than, that of I-Anil against its 
target; I-AchMI showed a reduced level of activity 
(Figure 2E). Each enzyme's ability to specifically cleave 
its substrate was also evaluated with a solution-type 
assay following release of yeast surface expressed protein 
to validate the flow cytometry (Supplementary Figure S6). 
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DNA recognition specificity profiles 

The above data indicate that an appreciable fraction of 
raw LHE ORFs identified in public databases by sequence 
similarity possess potentially useful enzymatic activities. 
To compare the biochemical properties of these enzymes 
in more detail, a 'one-off cleavage specificity profile was 
determined for WT I-Anil and each of the two highly 
active enzymes, I-HjeMI and I-PnoMI, using the yeast 
tethered DNA cleavage assay (Figure 3). In this assay, a 
panel of DNA substrates, each harboring a single base 
pair mismatch relative to the LHE's physiological target, 
are assessed for relative cleavability by the expressed 
enzyme. This assessment revealed that, as expected, 
I-HjeMI and I-PnoMI exhibit overall I-Anil-like profiles 
with localized variances in positions where their predicted 
targets sites differ from that of I- Anil. For example, 
I-HjeMI exhibited elevated specificity at position —2 
compared with the other two enzymes, but reduced speci- 
ficity at —8, and to a lesser extent, —7 and —6, while 
I-PnoMI preferred a "F at —5, one of the two differences 
in its cognate target from I-Anil. Some small idiosyncratic 
differences were also observed, such as I-HjeMI preferring 



a 'G' at the —5 position, despite the fact that its predicted 
native target site has an 'A'. 

We also assessed the potential for 'end holding', a 
property in which one DNA half-site is bound (and 
retained after cleavage) with particularly high affinity by 
the LHE when compared with the opposing half-site. This 
behavior is particularly notable for I-Anil and has been 
exploited for computational design purposes (14). Similar 
to I-Anil, both I-HjeMI and I-PnoMI were found to 
'end-hold' the minus (or left) half of their DNA substrates 
(Supplementary Figure S7). This asymmetric pattern 
suggests that these homologs use a similar nucleotide dis- 
crimination mechanism to I- Anil (14), consistent with the 
high conservation of amino acid identity in the protein/ 
DNA interface among the three enzymes in the beta 
sheets regions of the strand-turn-strand domains (20) 
(Figure 1A). 

In vivo LHE activity 

The three enzymes that exhibited detectable surface 
expression and cleavage activity (I-AchMI, I-PnoMI and 
I-HjeMI) were also assayed for their potential for 
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and therefore minimal cleavage, while values above one indicate a target is cleaved more efficiently than the predicted target. 
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endogenous DNA targeting and genome engineering using 
a reporter system in a human cell line. For this purpose, a 
recently described reporter system (18) was used to deter- 
mine the relative ability of each LHE to induce mutagenic 
non-homologous end-joining (NHEJ) or HR, key genome 
engineering events. The reporter system is comprised of 
two parts: a chromosome-embedded reporter and an 
endonuclease expression and repair template vector 
(Supplementary Figure SI). If a break is generated in 
the reporter, it can be repaired by HR using the 
template GFP sequence to restore a functional GFP 
protein and the cell will be green. If the break is 
repaired by mutagenic NHEJ with a frameshift to the 
+3 reading frame, the GFP will be read-through and the 
mCherry will now be properly translated in-frame, 
producing a red cell. Nuclease expression and donor 
delivery is tracked by a blue fluorescent protein linked in 
translation via a T2A self-cleaving sequence. 

To implement the assay, polyclonal cell lines were 
generated which harbored integrated single copies of re- 
porters possessing each respective enzymes' target site. 
Next, each of these cell lines were transfected with equal 
amounts of a donor template plasmid which also drives 
expression of the respective homing endonuclease. This 
resulted in similar distributions and sums of nuclease ex- 
pression and repair template copy number, as assessed by 
the expression (fluorescence) of a monomeric blue fluores- 
cent protein, mTagBFP (Figure 4A). I-AchMI exhibited 
little to no in vivo activity, consistent with its poor per- 
formance in the yeast tethered flow cleavage assay — this 
may reflect either an actual reduced catalytic efficiency, or 
that an impaired protein folding and/or thermal stability 
limits accumulation of active enzyme in cells cultured at 
37°C. For these reasons, I-AchMI should be considered a 
compromised engineering scaffold for in vivo applications. 

In contrast, I-PnoMI and especially I-HjeMI, 
demonstrated repair of the GFP reporter by HR at 



frequencies much higher than native I-Anil and compar- 
able with the previously reported increased activity variant 
Y2-Ani (35) (Figure 4B and C). Furthermore, remarkably 
high levels of mutagenic NHEJ were observed for I-HjeMI 
in the traffic light reporter assay: levels ~3-fold higher 
than those stimulated by Y2-Ani and I-PnoMI. Thus, 
biotechnologically relevant activities appear to vary sub- 
stantially among this group of closely related proteins. 

I-HjeMI crystal structure 

Based on I-HjeMI's enhanced in vitro and in vivo function- 
al properties, we were curious whether it might possess 
structural differences from I-Anil that could be identified 
and correlated with its performance characteristics. Thus, 
we expressed I-HjeMI in bacteria, purified it to homogen- 
eity and placed it into crystallization trials using a 
spectrum of standard conditions. In striking contrast to 
I-Anil, which we have found to be prone to chronic ag- 
gregation that required multiple solubilizing mutations to 
ameliorate, I-HjeMI was easily produced in large 
quantities and remained soluble, even at a 20mg/ml con- 
centration of the purified protein. The structure of the 
resulting complex of THjel bound to its DNA target 
was determined at 3.0 A resolution. 

Two separate copies of the I-HjeMI/DNA complex are 
found within the asymmetric unit of the crystal form as 
described in 'Materials and Methods' section. One of the 
complexes (corresponding to protein 'chain A' and DNA 
'chains B and C in the resulting model) was extremely well 
ordered, and displayed crystallographic packing contacts 
in which the overhanging A and T bases of the DNA 
duplex formed a continuous, Watson-Crick matched 
pseudocontinuous helix. The density for the entire 
complex (with the exception of several disordered 
residues in the linker that connects the two protein 
domains) was very clear, and the resulting model 
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Figure 4. LHE functionality in vivo. (A) Nuclease-expression histogram. Number of cells (7-axis) of a given mTagBFP fluorescence (X-axis) are 
shown to be uniform for all transfected cells (solid line) and are compared with an untransfected control (dashed line). Gates used for comparison of 
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loci, by event type, are shown as a percentage of the total expressing population. 



displayed excellent Ramachandran distribution and 
equally outstanding correlation to the density maps 
(Supplementary Figure S8). However, the second copy 
of the complex (corresponding to protein chain B in the 
resulting model) was very poorly ordered and displayed 
obvious clash at the ends of neighboring crystallographic- 
ally related DNA molecules that resulted in a disruption 
of the base pairing at both ends of the duplex. As a result, 
the overall fit of the model to the second copy of the 
complex was of much lower quality. The poor quality of 
density across the second copy of the complex and the 
equally challenging model fit to that density preventing 
the overall refinement i?-factors from being reduced to 
their usual acceptable values (i? WO rk and R tree correspond 
to 0.28 and 0.36, respectively, see Table 1). However, the 
high sequence identity (85%) of I-HjeMI to the I-Anil 
homing endonuclease [which has previously been solved 
and refined in multiple independent space groups to high 
resolution (19,20,35)] and the excellent quality of electron 
density for the well-ordered complex of I-HjeMI to its 
DNA target nevertheless allowed us to generate an unam- 
biguous comparison of the structures of the two homolo- 
gous homing endonucleases (Figure 5). The results below 
are based on the analysis of only the well-ordered complex 
of I-HjeMI. 

As expected, I-HjeMI displays a very similar overall 
structure to the structure of I-Anil (Figure 6), except for 
the few final residues of their C-termini and a short region 
of extended peptide sequence (spanning residues 123-129 
in I-HjeMI) that links the N-terminal and the C-terminal 
domains of the two enzymes. The regions of folded sec- 
ondary structure across the two enzymes and in particular 
the two central a-helices that contain the 'LAGLIDADG' 
sequence motifs, are closely superimposable [root mean 
square deviation (RMSD) less than 1 A between all 
a-carbons] while the overall RMSD for all a-carbons 
across the superimposed proteins is ~1.6A. The overall 
bend angles of the DNA and the geometric values of in- 
dividual base pairs (i.e. propeller twist, roll, etc.) in the 
I-HjeMI and I-Anil complexes were also very similar. 

Of the 37 amino acid substitutions that distinguish 
I-HjeMI from I-Anil, 11 are located in the N-terminal 
folded domain (residues 1-110), 18 are located in the 
C-terminal domain (residues 126-254) and 8 are located 
in the linker that connects the two (residues 111-125). Of 
those substitutions, none are located in the 
LAGLIDADG helices and very few are buried in the 
hydrophobic core (the exception being 1212, 1213, L215 
and L235 in the core of the I-Anil C-terminal domain, 
which are instead V212, V213, 1216 and 1235 in 
I-HjeMI). The remainder of amino acid differences 
involves residue positions that are partially or fully 
surface accessible. Four substitutions appear to involve 
residues that are involved in DNA contacts: 155, Sill, 
R172 and K200 in I-Anil are instead K55, Ylll, K172 
and R200 in I-HjeMI. Of these substitutions, two (I55K 
and SI 1 1Y) result in additional nonspecific contacts to the 
DNA backbone, one (R172K) appears to have little effect 
on the structural mechanism of DNA recognition, and one 
(K200R) involves a side chain that appears to make 
contacts to nucleotide bases in the DNA target site 
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Table 1. Data collection and refinement statistics 



Data collection 



ALS beamline 


BL5.0.1 


Wave length (A) 


1.00000 


Space group 


P2i2i2 


Unit cell dimension (A) 


a = 181.6, b = 73.6, c = 82.0 


Asymmetric unit content 


Two complexes 


Total reflections 


85 960 


Unique reflections 


22491 


Resolution (A) a 


50.00-3.00 (3.11-3.00) 


Completeness (%) a 


98.4 (94.5) 


Redundancy" 


3.9 (3.6) 


R a ' b 
-Emerge 


0.046 (0.088) 


Average I/sP 


22.8 (13.8) 


Refinement 




^work (%) 


0.28 


-ftfree ("/of 


0.36 


Protein residues 


504 


Nucleotides 


112 


Water molecules 


175 


Metal ions 


5 Ca ++ 


RMSD bond length (A) 


0.0103 


RSMD bond angle (°) 


1.704 


Ramachandran 


90% preferred, 7% allowed, 


distribution (%) 


3.0% outliers 


Ramachandran distribution 


93% preferred, 6% allowed, 


(Copy A) 


1% outliers 


Average B-factors (A ) 


24.5 



"Highest resolution shell values in parenthesis. 

Emerge = ^\hi~ <^z>l/S/;,_ where //„ is the ith measurement of reflec- 
tion h, and <//,> is the average measured intensity of reflection /;. 
'^work/^rree = S*l^*(o) - *A(o)l/E*i*A(o)l. where ^free was calculated with 
5% of the data excluded from refinement. 




Figure 5. I-HjeMI model and electron density map. There is 
high-quality density (blue mesh) in the well-ordered complex around 
(A) R202 (K200 in Ani) and (B) Y113 (Ylll in Ani). 
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Figure 6. Structure of I-HjeMI. The solved structure of I-HjeMI 
(green) is shown bound to its target DNA (gray). This structure has 
been aligned to that of I-Anil (cyan) with differences highlighted red. 



(the arginine in I-HjeMI is located within H-bond distance 
to — 4 A and — 5G on one strand of the DNA target). The 
substitution of a tyrosine for serine at residue 111 (S111Y, 
which results in a nonspecific interaction to the DNA 
backbone outside of the 22 base pair target site) corres- 
ponds to a mutation that was previously introduced into 
I-Anil during a selection experiment for improved 
cleavage of its own wild-type target site (35). 



DISCUSSION 

This study demonstrates that a survey of evolutionary 
sequence space around a specific LHE is able to rapidly 
identify variants with a range of desirable properties for 
gene targeting applications. These enzymes displayed 
varying levels of thermostability, solubility, 
crystalizability, cleavage activity and capacity to induce 
different rates of site specific genetic alterations when 



expressed in a human cell line. These results highlight 
the considerable utility of surveying evolutionarily infor- 
mation as a supplement to rational protein engineering of 
novel LHE variants with specific properties. 

The use of yeast surface display allowed multiple 
properties of each LHE ORF to be rapidly assessed, 
including thermal stability, binding and cleavage 
activity. Importantly, enzymes whose surface expressed 
well and exhibited significant activity in the tethered 
cleavage assay tended to perform very well in vivo, the 
notable exception being the originally described family 
member, I- Anil. Additionally, by combining an initial 
yeast surface display assessment with an in vivo reporter 
assay, we were able to identify two new enzymes that 
exhibit in vivo performance on par or better than 
Y2-Ani — an engineered variant specifically identified to 
have improved cleavage properties (35). As engineering 
attempts to modify an LHE's target specificity are often 
associated with reductions of catalytic efficiency toward 
the new target site, the availability of scaffolds with 
improved in vivo performance may provide both 
optimized starting points for engineering, as well as infor- 
mation on protein modifications that can be made to 
improve the performance of engineered variants. Further- 
more, I-HjeMFs high solubility and crystalizability 
allowed rapid structural analysis, which can be a 
powerful tool when compiling many changes to a 
scaffold (6). 

We chose our original search parameters to include 
both highly related homologs and those exhibiting 
locally altered substrate specificities, with the goal that 
sequence information from related scaffolds with differing 
specificities would help to inform engineering of the 
scaffolds to cleave new target sites. As exemplified in a 
contemporaneous work by Szeto et al. (36), small 
specificity-determining pockets which have been evolu- 
tionarily selected can be elucidated by comparing 
homologs at places of divergent sequence specificity, and 
these changes can be grafted onto related enzymes. 
In addition to revealing locally altered substrate 
specificities and activities, evolutionary sequence informa- 
tion may also help to inform us about the plasticity of 
enzymes at certain positions. This idea is supported by 
the improved specificity (versus I-Anil) that we observed 
at position —2 for I-HjeMI, the improved specificity of 
I-PnoMI at —8 relative to I-HjeMI and the reverse rela- 
tionship at the position +5. Therefore, these locations may 
be identified as particularly amenable to modification, 
thus facilitating protein engineering of the scaffold group. 

An interesting question raised by our results that 
warrants further investigation is whether there are distin- 
guishing features of those proteins that did not surface 
express well or perform well in vivo, or those that per- 
formed exceptionally well. Strikingly, I-HjeMI and 
I-PnoMI natively harbor a tyrosine at the position analo- 
gous to 111 in I-Anil, one of two changes in Y2-Anil 
identified by directed evolution to significantly enhance 
the activity of the original I-Anil enzyme (28). As polar 
surface residues have been found to play critical roles in 
protein folding and stability (37^10), we generated 
homology models of the six I-Anil homologs analyzed 
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here and used them to predict surface exposed residues. 
Consistent with the importance of these residues in 
promoting stable folding, I-HjeMI, I-PnoMI and 
I-Anil-I were predicted to possess fewer overall 
solvent-exposed hydrophobic residues outside of the 
protein-DNA interface (19,19,21) than any of I-VinIP, 
I-TasMIP and I-TinMIP (24,23,25). Incorporation of 
such analysis may allow to rapidly pare down a list of 
homologous LHE's identified in public sequence data- 
bases to those most likely to possess biotechnologically 
relevant or other positive attributes. 
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